A natural framework for sparse hierarchical clustering

نویسندگان

  • Hongyang Zhang
  • Ruben H. Zamar
چکیده

There has been a surge in the number of large and flat data sets – data sets containing a large number of features and a relatively small number of observations – due to the growing ability to collect and store information in medical research and other fields. Hierarchical clustering is a widely used clustering tool. In hierarchical clustering, large and flat data sets may allow for a better coverage of clustering features (features that help explain the true underlying clusters) but, such data sets usually include a large fraction of noise features (non-clustering features) that may hide the underlying clusters. Witten and Tibshirani (2010) proposed a sparse hierarchical clustering framework to cluster the observations using an adaptively chosen subset of the features, however, we show that this framework has some limitations when the data sets contain clustering features with complex structure. In this paper, another sparse hierarchical clustering (SHC) framework is proposed. We show that, using simulation studies and real data examples, the proposed framework produces superior feature selection and clustering performance comparing to the classical (of-the-shelf) hierarchical clustering and the existing sparse hierarchical clustering framework.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multi-rank Sparse Hierarchical Clustering

There has been a surge in the number of large and flat data sets – data sets containing a large number of features and a relatively small number of observations – due to the growing ability to collect and store information in medical research and other fields. Hierarchical clustering is a widely used clustering tool. In hierarchical clustering, large and flat data sets may allow for a better co...

متن کامل

A Probabilistic Hierarchical Clustering Method for Organising Collections of Text Documents

In this paper a generic probabilistic framework for the unsupervised hierarchical clustering of large-scale sparse high-dimensional data collections is proposed. The framework is based on a hierarchical probabilistic mixture methodology. Two classes of models emerge from the analysis and these have been termed as symmetric and asymmetric models. For text data specifically both asymmetric and sy...

متن کامل

Probabilistic Hierarchical Clustering Method for Organizing Collections of Text Documents

In this paper a generic probabilistic framework for the unsupervised hierarchical clustering of large-scale sparse high-dimensional data collections is proposed. The framework is based on a hierarchical probabilistic mixture methodology. Two classes of models emerge from the analysis and these have been termed as symmetric and asymmetric models. For text data specifically both asymmetric and sy...

متن کامل

A framework for feature selection in clustering.

We consider the problem of clustering observations using a potentially large set of features. One might expect that the true underlying clusters present in the data differ only with respect to a small fraction of the features, and will be missed if one clusters the observations using the full set of features. We propose a novel framework for sparse clustering, in which one clusters the observat...

متن کامل

Analyzing Motorcycle Crash Pattern and Riders’ Fault Status at a National Level: A Case Study from Iran

Motorcycle crashes constitute a significant proportion of traffic accidents all over the world. The aim of this paper was to examine the motorcycle crash patterns and rider fault status across the provinces of Iran. For this purpose, 6638 motorcycle crashes occurred in Iran through 2009-2012 were used as the analysis data and a two-step clustering approach was adopted as the analysis framework....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1409.0745  شماره 

صفحات  -

تاریخ انتشار 2014